--- title: "[old portfolio] Kaggle Xgboost" date: 2020-01-01 00:00:00 +0900 categories: jekyll update --- FutureSales

Analysis of sales data and predictions of monthly sales

By Huijun Park

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
os.getcwd()
Out[1]:
'E:\\kaggle\\competitive-data-science-predict-future-sales'
In [ ]:
os.chdir('./competitive-data-science-predict-future-sales')

Import description files

We are not going to analyze each items for the following reasons

#1: I cannot read Russian

#2: I do not have sufficient domain knowledge of the electronic sales department

In [45]:
cat = pd.read_csv('item_categories.csv')
items = pd.read_csv('items.csv')
shops = pd.read_csv('shops.csv')

ID numbers for each category of items

In [48]:
cat.tail()
Out[48]:
item_category_name item_category_id
79 Служебные 79
80 Служебные - Билеты 80
81 Чистые носители (шпиль) 81
82 Чистые носители (штучные) 82
83 Элементы питания 83

ID numbers and category for each items

In [49]:
items.tail()
Out[49]:
item_name item_id item_category_id
22165 Ядерный титбит 2 [PC, Цифровая версия] 22165 31
22166 Язык запросов 1С:Предприятия [Цифровая версия] 22166 54
22167 Язык запросов 1С:Предприятия 8 (+CD). Хрустале... 22167 49
22168 Яйцо для Little Inu 22168 62
22169 Яйцо дракона (Игра престолов) 22169 69

ID numbers for each shops

In [50]:
shops.tail()
Out[50]:
shop_name shop_id
55 Цифровой склад 1С-Онлайн 55
56 Чехов ТРЦ "Карнавал" 56
57 Якутск Орджоникидзе, 56 57
58 Якутск ТЦ "Центральный" 58
59 Ярославль ТЦ "Альтаир" 59

There are 60 shops, 84 categories, and 22170 items

Import train data

In [51]:
train = pd.read_csv('sales_train_v2.csv')
In [71]:
train.dropna(how='any'); # there seems to be no missing data
In [76]:
train.head()
Out[76]:
date date_block_num shop_id item_id item_price item_cnt_day
0 02.01.2013 0 59 22154 999.00 1.0
1 03.01.2013 0 25 2552 899.00 1.0
2 05.01.2013 0 25 2552 899.00 -1.0
3 06.01.2013 0 25 2554 1709.05 1.0
4 15.01.2013 0 25 2555 1099.00 1.0

There seems to be refunds too

In [113]:
train[(train.date=='05.01.2013') & (train.item_id==2552)]
Out[113]:
date date_block_num shop_id item_id item_price item_cnt_day
2 05.01.2013 0 25 2552 899.0 -1.0

For future purpose, the total balance including refunds is calculated

In [117]:
train['item_balance']=train['item_price']*train['item_cnt_day']
train.head()
Out[117]:
date date_block_num shop_id item_id item_price item_cnt_day item_balance
0 02.01.2013 0 59 22154 999.00 1.0 999.00
1 03.01.2013 0 25 2552 899.00 1.0 899.00
2 05.01.2013 0 25 2552 899.00 -1.0 -899.00
3 06.01.2013 0 25 2554 1709.05 1.0 1709.05
4 15.01.2013 0 25 2555 1099.00 1.0 1099.00

Date block seems to indicate the month

In [122]:
train.loc[:,['date_block_num','item_cnt_day','item_balance']].groupby(['date_block_num']).sum().head()
Out[122]:
item_cnt_day item_balance
date_block_num
0 131479.0 9.194709e+07
1 128090.0 9.066571e+07
2 147142.0 1.049327e+08
3 107190.0 6.915429e+07
4 106970.0 6.506531e+07
In [178]:
plt.rcParams['font.size'] = 15
fig, ax1=plt.subplots(figsize=(12,6))
ax2=ax1.twinx()
color = (0,0,1)
train.loc[:,['date_block_num','item_cnt_day']].groupby(['date_block_num']).sum().plot(fig=fig,ax=ax1,color=color);
ax1.set_ylabel('Sold Items',color=color)
ax1.set_xlabel('Months',color=(0,0,0))
ax1.tick_params(axis='y', labelcolor=color)
ax1.set_ylim((0,2e5))
color = 'tab:red'
train.loc[:,['date_block_num','item_balance']].groupby(['date_block_num']).sum().plot(fig=fig,ax=ax2,color=color);
ax2.set_ylabel('Balance',color=color)
ax2.tick_params(axis='y', labelcolor=color)
ax2.set_ylim((0,2.5e8))
ax1.legend().set_visible(False)
ax2.legend().set_visible(False)

The peak months are holiday seasons. As it turns out, people tend to buy more expensive goods on holidays seaons, which can be inferred from $\frac{earning}{items}$. The sales data is highly periodic

In [174]:
train[(train.date_block_num==11) | (train.date_block_num==23)]
Out[174]:
date date_block_num shop_id item_id item_price item_cnt_day item_balance
1124316 04.12.2013 11 25 17769 199.0 1.0 199.0
1124317 15.12.2013 11 25 18016 500.0 1.0 500.0
1124318 22.12.2013 11 25 17763 399.0 1.0 399.0
1124319 31.12.2013 11 25 17760 3250.0 1.0 3250.0
1124320 18.12.2013 11 25 17763 398.5 1.0 398.5
... ... ... ... ... ... ... ...
2323418 11.12.2014 23 25 5037 2599.0 1.0 2599.0
2323419 28.12.2014 23 25 5037 1999.0 2.0 3998.0
2323420 03.12.2014 23 25 5038 2999.0 1.0 2999.0
2323421 06.12.2014 23 25 5033 1199.0 1.0 1199.0
2323422 31.12.2014 23 25 5037 1999.0 1.0 1999.0

274032 rows × 7 columns

In [188]:
cnt=train.loc[:,['date_block_num','item_cnt_day']].groupby(['date_block_num']).sum().to_numpy().reshape(-1)
bln=train.loc[:,['date_block_num','item_balance']].groupby(['date_block_num']).sum().to_numpy().reshape(-1)

Analyze power spectral density to see the frequency response of time series signals

In [237]:
def PSD(X):
    return np.square(np.abs(np.fft.fft(X)))
In [333]:
cnt.shape
Out[333]:
(34,)
In [380]:
fig, ax = plt.subplots(1,2,figsize=(12,4));
f = np.linspace(0, 0.5, int(len(cnt)/2))
ax[0].plot(f,PSD(cnt)[0:17]);
ax[0].set_ylim(0,2e11);
ax[0].set_title('Item count')
ax[0].set_xlabel('frequency (Month)')
ax[0].axvline(x=1/12,linewidth=4, color='r')
fig.text(0.15, .70, r'$T = 1$ year', {'color': 'r', 'fontsize': 20})
ax[1].plot(f,PSD(bln)[0:17]);
ax[1].set_ylim(0,3e17);
ax[1].set_title('Balance')
ax[1].set_xlabel('frequency (Month)')
ax[1].axvline(x=1/12,linewidth=4, color='r')
plt.tight_layout()
fig.suptitle('Power Spectral Density',position=(0.5,1.05),fontsize=20);

The autocorrelation function ACF can be calculated with Weiner-Khintchin theoreme

$ACF(X)=\mathfrak{F}^{-1}(PSD(X))$

https://www.itp.tu-berlin.de/fileadmin/a3233/grk/pototskyLectures2012/pototsky_lectures_part1.pdf

In [448]:
def ACF(X):
    return np.fft.fftshift(np.fft.ifft(np.square(np.abs(np.fft.fft(X)))))

def nACF(X): #Normalized Autocorrelation
    jp_=np.square(np.mean(X))
    pj_=np.mean(np.square(X))
    return (ACF(X)/len(X)-jp_)/(pj_-jp_)
In [449]:
fig, ax = plt.subplots(1,2,figsize=(12,4));
#f = np.linspace(0, len(cnt)[0], 1)
ax[0].plot(nACF(cnt)[17:]);
ax[0].set_title('Item count')
ax[0].set_xlabel(r'$\tau$'+' (Month)')
ax[0].axvline(x=12,linewidth=4, color='r')
fig.text(00.25, .70, r'$\tau = 1$ year', {'color': 'r', 'fontsize': 20})
ax[1].plot(nACF(bln)[17:]);
ax[1].set_title('Balance')
ax[1].set_xlabel(r'$\tau$'+' (Month)')
ax[1].axvline(x=12,linewidth=4, color='r')
plt.tight_layout()
fig.suptitle('Autocorrelation',position=(0.5,1.05),fontsize=20);

We can clearly see the strong 12month frequency response indicating seasonal dependency

In conclusion we have to make a model that takes account of both general decline/incline of sales and seasonal shift

Let's see our test set

In [451]:
test = pd.read_csv('test.csv')
In [464]:
test.head(6)
Out[464]:
ID shop_id item_id
0 0 5 5037
1 1 5 5320
2 2 5 5233
3 3 5 5232
4 4 5 5268
5 5 5 5039

Training data for each ID looks quite sparse with mostly zero or one sale per month at first glance

In [465]:
train[(train.shop_id==5) & (train.item_id==5037)]
Out[465]:
date date_block_num shop_id item_id item_price item_cnt_day item_balance
1953995 21.09.2014 20 5 5037 2599.0 1.0 2599.0
2150561 29.11.2014 22 5 5037 2599.0 1.0 2599.0
2288630 28.12.2014 23 5 5037 1999.0 1.0 1999.0
2288631 20.12.2014 23 5 5037 1999.0 1.0 1999.0
2335446 02.01.2015 24 5 5037 1999.0 1.0 1999.0
2335447 07.01.2015 24 5 5037 1999.0 1.0 1999.0
2618926 29.05.2015 28 5 5037 1299.0 1.0 1299.0
2704068 28.06.2015 29 5 5037 1499.0 1.0 1499.0
2719247 05.07.2015 30 5 5037 1499.0 1.0 1499.0
2810661 14.08.2015 31 5 5037 1499.0 1.0 1499.0
2810662 20.08.2015 31 5 5037 749.5 1.0 749.5
2810663 31.08.2015 31 5 5037 749.0 1.0 749.0
2860998 05.09.2015 32 5 5037 749.5 1.0 749.5
In [676]:
plt.scatter(train[(train.shop_id==5) & (train.item_id==5037)]['date_block_num'],train[(train.shop_id==5) & (train.item_id==5037)]['item_cnt_day']);
plt.xlabel('month')
plt.ylabel('items sold')
plt.title('shop 5, item 5037');

Reorganize data and sum the sales for each month

In [491]:
grouped=train.groupby(['shop_id','item_id','date_block_num'])
grouped.first()
Out[491]:
date item_price item_cnt_day item_balance
shop_id item_id date_block_num
0 30 1 21.02.2013 265.0 2.0 530.0
31 1 15.02.2013 434.0 3.0 1302.0
32 0 03.01.2013 221.0 2.0 442.0
1 25.02.2013 221.0 1.0 221.0
33 0 03.01.2013 347.0 1.0 347.0
... ... ... ... ... ... ...
59 22164 27 30.04.2015 699.0 1.0 699.0
30 21.07.2015 699.0 1.0 699.0
22167 9 25.10.2013 299.0 1.0 299.0
11 14.12.2013 299.0 1.0 299.0
17 14.06.2014 299.0 1.0 299.0

1609124 rows × 4 columns

In [497]:
grouped.get_group((0,30,1))['item_cnt_day'].sum()
Out[497]:
31.0
In [939]:
groupsum=grouped.sum()
In [505]:
groupsum.groupby('date_block_num').tail(100)
Out[505]:
item_price item_cnt_day item_balance
shop_id item_id date_block_num
59 15854 33 859.0 1.0 859.0
15866 33 1299.0 1.0 1299.0
15919 28 1449.0 1.0 1449.0
16003 33 1499.0 1.0 1499.0
16038 28 299.0 1.0 299.0
... ... ... ... ...
22164 27 1398.0 2.0 1398.0
30 699.0 1.0 699.0
22167 9 299.0 1.0 299.0
11 598.0 2.0 598.0
17 299.0 1.0 299.0

3400 rows × 3 columns

In [512]:
ledger=groupsum['item_cnt_day']
In [521]:
train[(train.shop_id==0) & (train.item_id==30)]
Out[521]:
date date_block_num shop_id item_id item_price item_cnt_day item_balance
173344 21.02.2013 1 0 30 265.0 2.0 530.0
173345 20.02.2013 1 0 30 265.0 2.0 530.0
173346 18.02.2013 1 0 30 265.0 4.0 1060.0
173347 17.02.2013 1 0 30 265.0 4.0 1060.0
173348 16.02.2013 1 0 30 265.0 9.0 2385.0
173349 15.02.2013 1 0 30 265.0 2.0 530.0
173361 22.02.2013 1 0 30 265.0 2.0 530.0
173402 23.02.2013 1 0 30 265.0 3.0 795.0
173441 26.02.2013 1 0 30 265.0 3.0 795.0

Let us observe how each shops perform each month

In [540]:
shop_performance=np.zeros((60,34))
for s in range(60):
    for i in range(34):
        shop_performance[s][i]=train[(train.shop_id==s) & (train.date_block_num==i)].sum()['item_cnt_day']
        if s%10==0 and i==0:
            print("{0}/60".format(s))
0/60
10/60
20/60
30/60
40/60
50/60
In [560]:
fig, ax = plt.subplots(10,6,figsize=(50,30));
iter=0
for ax1 in ax:
    for axes in ax1:
        axes.plot(shop_performance[iter])
        iter+=1
plt.tight_layout()
In [596]:
from sklearn.linear_model import LinearRegression
In [643]:
def regfunc(X):
#    x=[list(l) for l in zip(range(np.shape(X)[0]),X)]
    reg=LinearRegression().fit(np.reshape(range(np.shape(X)[0]),(-1,1)), X)
    return [reg.coef_[0],reg.intercept_]

Assumption : The signal can be decomposed into a linear trend + $\epsilon$

linear model can be regressed

The $\epsilon$ can be estimated from averaging 12 month recurring pattern

In [681]:
fig, ax = plt.subplots(1,3,figsize=(12,4));
months=len(shop_performance[7])
ax[0].plot(shop_performance[7])
ax[0].set_ylim(0)
ax[0].set_title('linear regression')
a,b=regfunc(shop_performance[7])
ax[0].plot(range(months)*a+b)
ax[1].plot(shop_performance[7]-(range(months)*a+b))
ax[1].set_title('$\epsilon$');
ax[2].plot((shop_performance[7]-(range(months)*a+b))[:12])
ax[2].plot((shop_performance[7]-(range(months)*a+b))[12:24])
ax[2].plot((shop_performance[7]-(range(months)*a+b))[24:])
ax[2].set_title('$\epsilon$-yearly');
fig.suptitle('Monthly performance of shop 7',position=(0.5,1.05), fontsize=20)
plt.tight_layout()
In [591]:
deadshop=[]
iter=0
for sales in shop_performance:
    if sales[33]==0:
        print(iter,end=' ')
        deadshop.append(iter)
    iter+=1
0 1 8 11 13 17 23 27 29 30 32 33 40 43 51 54 

A few number of shops seem to have gotten closed down. They can be manually set to have 0 expectations for sale. Some shops only open for one month per year.

In [1107]:
def pred_nextmonth(X):
    if X[-1]==0:
        return 0 # dead shop sells zero
    l=len(X)
    a,b=regfunc(X)
    e=X-(a*range(l)+b)
    n=[]
    for it in range(int((l-1)/12)):
        n.append(e[l-12*it-12])
    if len(n)==0:
        m=0
    else:
        m=np.mean(n)
    return np.max([a*l+b+m,0]) #linear regression + epsilon(mean of the sales of the same month from prior years)
def month_append(X):
    return np.append(X,pred_nextmonth(X))
In [789]:
A=month_append(shop_performance[7])
for i in range(12):
    A=month_append(A)
fig, ax = plt.subplots(1,3,figsize=(12,4));
months=len(A)
ax[0].plot(A)
ax[0].set_ylim(0)
ax[0].set_title('linear regression')
a,b=regfunc(A)
ax[0].plot(range(months)*a+b)
ax[1].plot(A-(range(months)*a+b))
ax[1].set_title('$\epsilon$');
ax[2].plot((A-(range(months)*a+b))[:12])
ax[2].plot((A-(range(months)*a+b))[12:24])
ax[2].plot((A-(range(months)*a+b))[24:36])
ax[2].plot((A-(range(months)*a+b))[36:])
ax[2].set_title('$\epsilon$-yearly');
fig.suptitle('Next yearly predictions of shop 7',position=(0.5,1.05), fontsize=20)
plt.tight_layout()
In [1109]:
shop_prediction=[]
for perf in shop_performance:
    for i in range(12):
        perf=month_append(perf)
    shop_prediction.append(perf)
In [1110]:
fig, ax = plt.subplots(10,6,figsize=(50,30));
iter=0
for ax1 in ax:
    for axes in ax1:
        axes.plot(shop_prediction[iter])
        axes.axvspan(34, len(shop_prediction[iter])-1, facecolor='r', alpha=0.2)
        iter+=1
fig.suptitle('predictions of sales for the next year for each shop',position=(0.5,1.01),fontsize=30)
plt.tight_layout()

The red parts are predictions

Let us observe if the same trend applies for item categories

In [569]:
cat_list=items['item_category_id'].to_numpy()
In [573]:
train['item_categories']=cat_list[train['item_id']]
train.head()
Out[573]:
date date_block_num shop_id item_id item_price item_cnt_day item_balance item_categories
0 02.01.2013 0 59 22154 999.00 1.0 999.00 37
1 03.01.2013 0 25 2552 899.00 1.0 899.00 58
2 05.01.2013 0 25 2552 899.00 -1.0 -899.00 58
3 06.01.2013 0 25 2554 1709.05 1.0 1709.05 58
4 15.01.2013 0 25 2555 1099.00 1.0 1099.00 56
In [575]:
cat_performance=np.zeros((84,34))
for s in range(84):
    for i in range(34):
        cat_performance[s][i]=train[(train.item_categories==s) & (train.date_block_num==i)].sum()['item_cnt_day']
        if s%10==0 and i==0:
            print("{0}/84".format(s))
0/84
10/84
20/84
30/84
40/84
50/84
60/84
70/84
80/84
In [577]:
fig, ax = plt.subplots(12,7,figsize=(50,40));
iter=0
for ax1 in ax:
    for axes in ax1:
        axes.plot(cat_performance[iter])
        iter+=1
plt.tight_layout()

Some categories have neglible to almost none of sales

In [584]:
deadcat=[]
iter=0
for sales in cat_performance:
    if sales[33]==0:
        print(iter,end=' ')
        deadcat.append(iter)
    iter+=1
0 1 4 10 13 17 18 32 39 46 48 50 51 52 53 59 66 68 80 81 82 
In [812]:
cat_prediction=[]
for perf in cat_performance:
    for i in range(12):
        perf=month_append(perf)
    cat_prediction.append(perf)
In [1054]:
fig, ax = plt.subplots(10,6,figsize=(50,30));
iter=0
for ax1 in ax:
    for axes in ax1:
        axes.plot(cat_prediction[iter])
        axes.axvspan(34, len(cat_prediction[iter])-1, facecolor='r', alpha=0.2)
        iter+=1
fig.suptitle('predictions for the next year for each category of items',position=(0.5,1.01),fontsize=30)
plt.tight_layout()
In [943]:
groupsum=groupsum.reset_index()
In [883]:
npmatrix=groupsum[['shop_id','item_id','date_block_num']].to_numpy()
In [884]:
npmatrix
Out[884]:
array([[    0,    30,     1],
       [    0,    31,     1],
       [    0,    32,     0],
       ...,
       [   59, 22167,     9],
       [   59, 22167,    11],
       [   59, 22167,    17]], dtype=int64)
In [882]:
items['item_category_id']
Out[882]:
0        40
1        76
2        40
3        40
4        40
         ..
22165    31
22166    54
22167    49
22168    62
22169    69
Name: item_category_id, Length: 22170, dtype: int64

Calculate sales prediction of same shop and categories for the next month for each entry

In other words, add the forecast parameter for how well the respective shop/category will sell. This is an idea came from "alpha" constant of CAPM https://en.wikipedia.org/wiki/Capital_asset_pricing_model which takes account of the movement of the market as a whole

In [904]:
from tqdm import tqdm, tqdm_notebook
In [914]:
newmatrix=[]
iterationcounter=0
for i in tqdm_notebook(npmatrix):
    i=np.append(i, month_append(shop_performance[i[0]][:i[2]+1])[-1]) #shop sales prediction of next month
    category=items['item_category_id'][i[1]]
    month_append(cat_performance[category])
    i=np.append(i, month_append(cat_performance[category][:int(i[2]+1)])[-1]) #category sales prediction of next month
    newmatrix.append(i)
    if (iterationcounter%50000)==0:
        print('{0:07d} / {1:07d}\r'.format(iterationcounter,len(npmatrix)))
    iterationcounter+=1
0000000 / 1609124
0050000 / 1609124
0100000 / 1609124
0150000 / 1609124
0200000 / 1609124
0250000 / 1609124
0300000 / 1609124
0350000 / 1609124
0400000 / 1609124
0450000 / 1609124
0500000 / 1609124
0550000 / 1609124
0600000 / 1609124
0650000 / 1609124
0700000 / 1609124
0750000 / 1609124
0800000 / 1609124
0850000 / 1609124
0900000 / 1609124
0950000 / 1609124
1000000 / 1609124
1050000 / 1609124
1100000 / 1609124
1150000 / 1609124
1200000 / 1609124
1250000 / 1609124
1300000 / 1609124
1350000 / 1609124
1400000 / 1609124
1450000 / 1609124
1500000 / 1609124
1550000 / 1609124
1600000 / 1609124

In [915]:
pd.DataFrame(newmatrix).to_csv("newmatrix.csv")
In [927]:
newmatrix=np.asarray(newmatrix)
In [940]:
groupsum['shop_pred']=newmatrix[:,3]
groupsum['cat_pred']=newmatrix[:,4]
In [997]:
entry=[]
a=groupsum['shop_id']
b=groupsum['date_block_num']
c=groupsum['item_id']
d=len(groupsum)
for i in range(d-1):
    #print(a[i],c[i],b[i]+1)
    if a[i+1]==a[i] and b[i+1]==b[i]+1 and c[i+1]==c[i]:
        #entry.append(groupsum[(groupsum.shop_id==a[i]) & (groupsum.item_id==c[i]) & (groupsum.date_block_num==(b[i]+1))].sum()['item_cnt_day'])
        entry.append(groupsum.loc[i+1,'item_cnt_day'])
    else:
            entry.append(0)
    if (i%50000)==0:
        print('{0:07d} / {1:07d}\r'.format(i,d))
#    print(entry)
entry.append(0)
groupsum['true_nxt_month']=entry
0000000 / 1609124
0050000 / 1609124
0100000 / 1609124
0150000 / 1609124
0200000 / 1609124
0250000 / 1609124
0300000 / 1609124
0350000 / 1609124
0400000 / 1609124
0450000 / 1609124
0500000 / 1609124
0550000 / 1609124
0600000 / 1609124
0650000 / 1609124
0700000 / 1609124
0750000 / 1609124
0800000 / 1609124
0850000 / 1609124
0900000 / 1609124
0950000 / 1609124
1000000 / 1609124
1050000 / 1609124
1100000 / 1609124
1150000 / 1609124
1200000 / 1609124
1250000 / 1609124
1300000 / 1609124
1350000 / 1609124
1400000 / 1609124
1450000 / 1609124
1500000 / 1609124
1550000 / 1609124
1600000 / 1609124
In [1000]:
groupsum.to_csv('traindata.csv')
In [1004]:
groupsum=groupsum.drop('item_price',axis=1)
In [1006]:
groupsum.tail(10)
Out[1006]:
shop_id item_id date_block_num item_cnt_day item_balance shop_pred cat_pred true_nxt_month
1609114 59 22162 27 1.0 349.0 778.477833 6248.682266 1.0
1609115 59 22162 28 1.0 349.0 903.473892 6994.605911 0.0
1609116 59 22162 31 1.0 349.0 1054.433651 4327.526393 0.0
1609117 59 22164 25 2.0 1498.0 1267.272308 5243.923077 1.0
1609118 59 22164 26 1.0 749.0 825.230769 5863.752747 2.0
1609119 59 22164 27 2.0 1398.0 778.477833 5267.603448 0.0
1609120 59 22164 30 1.0 699.0 1093.559677 5499.426613 0.0
1609121 59 22167 9 1.0 299.0 1792.933333 699.800000 0.0
1609122 59 22167 11 2.0 598.0 2036.954545 729.818182 0.0
1609123 59 22167 17 1.0 299.0 1334.975232 640.826625 0.0

Organize the dataset so it can be fed to the regressor

In [1053]:
traingroup=groupsum[groupsum['date_block_num']<groupsum['date_block_num'].unique().max()] ##cannot train with the last month because we don't know the ground truth for the future
print(len(traingroup)/len(groupsum)) #about 2% of data dropped
traingroup.tail()
0.9804048662502082
Out[1053]:
shop_id item_id date_block_num item_cnt_day item_balance shop_pred cat_pred true_nxt_month
1609119 59 22164 27 2.0 1398.0 778.477833 5267.603448 0.0
1609120 59 22164 30 1.0 699.0 1093.559677 5499.426613 0.0
1609121 59 22167 9 1.0 299.0 1792.933333 699.800000 0.0
1609122 59 22167 11 2.0 598.0 2036.954545 729.818182 0.0
1609123 59 22167 17 1.0 299.0 1334.975232 640.826625 0.0
In [1088]:
lastmonth=groupsum[groupsum['date_block_num']==groupsum['date_block_num'].unique().max()]
lastmonth.tail()
Out[1088]:
shop_id item_id date_block_num item_cnt_day item_balance shop_pred cat_pred true_nxt_month
1608998 59 22087 33 6.0 714.0 1229.143392 110.629565 0.0
1609030 59 22088 33 2.0 238.0 1229.143392 110.629565 0.0
1609047 59 22091 33 1.0 179.0 1229.143392 110.629565 0.0
1609073 59 22100 33 1.0 629.0 1229.143392 307.787624 0.0
1609076 59 22102 33 1.0 1250.0 1229.143392 307.787624 0.0
In [1056]:
tlen=len(test)
test['date_block_num']=np.ones(tlen)*groupsum['date_block_num'].unique().max()
In [1095]:
traingroup=traingroup.drop('item_balance',axis=1)
traingroup.tail()
Out[1095]:
shop_id item_id date_block_num item_cnt_day shop_pred cat_pred true_nxt_month
1609119 59 22164 27 2.0 778.477833 5267.603448 0.0
1609120 59 22164 30 1.0 1093.559677 5499.426613 0.0
1609121 59 22167 9 1.0 1792.933333 699.800000 0.0
1609122 59 22167 11 2.0 2036.954545 729.818182 0.0
1609123 59 22167 17 1.0 1334.975232 640.826625 0.0
In [1075]:
a=test.iloc[:,1:3].to_numpy()[:,0]
b=test.iloc[:,1:3].to_numpy()[:,1]
In [1089]:
countvector=np.zeros(tlen)
for i in range(tlen):
    countvector[i]=lastmonth[(lastmonth['shop_id']==a[i]) & (lastmonth['item_id']==b[i])].item_cnt_day.sum()
    if (i%50000)==0:
        print('{0:07d} / {1:07d}\r'.format(i,tlen))
0000000 / 0214200
0050000 / 0214200
0100000 / 0214200
0150000 / 0214200
0200000 / 0214200
In [1091]:
test['item_cnt_day']=countvector
test.tail()
Out[1091]:
ID shop_id item_id date_block_num item_cnt_day
214195 214195 45 18454 33.0 1.0
214196 214196 45 16188 33.0 0.0
214197 214197 45 15757 33.0 0.0
214198 214198 45 19648 33.0 0.0
214199 214199 45 969 33.0 0.0
In [1128]:
shop_p=[]
for perf in shop_performance:
    perf=month_append(perf)
    shop_p.append(perf[-1])
cat_p=[]
for perf in cat_performance:
    cat_p.append(pred_nextmonth(perf))
In [1141]:
shopv=[]
catv=[]
catindex=items['item_category_id'].to_numpy()
for shops in test.shop_id.to_numpy():
    shopv.append(shop_p[shops])
for itm in test.item_id.to_numpy():
    catv.append(cat_p[catindex[itm]])
In [1143]:
test['shop_pred']=shopv
test['cat_pred']=catv
In [1144]:
test.head()
Out[1144]:
ID shop_id item_id date_block_num item_cnt_day shop_pred cat_pred
0 0 5 5037 33.0 0.0 1451.365852 2668.561039
1 1 5 5320 33.0 0.0 1451.365852 6500.933384
2 2 5 5233 33.0 1.0 1451.365852 2668.561039
3 3 5 5232 33.0 0.0 1451.365852 4228.517418
4 4 5 5268 33.0 0.0 1451.365852 11717.625898
In [1147]:
traingroup.to_csv('traindata.csv')
test.to_csv('testdata.csv')

Now we have collected and/or synthesised 6 features and with these features we will predict the item sales for the upcoming month

In [2]:
traingroup=pd.read_csv('traindata.csv')
test=pd.read_csv('testdata.csv')
traingroup.head()
Out[2]:
Unnamed: 0 shop_id item_id date_block_num item_cnt_day shop_pred cat_pred true_nxt_month
0 0 0 30 1 31.0 6676.0 29809.0 0.0
1 1 0 31 1 11.0 6676.0 6520.0 0.0
2 2 0 32 0 6.0 5578.0 33489.0 10.0
3 3 0 32 1 10.0 6676.0 29809.0 0.0
4 4 0 33 0 3.0 5578.0 6094.0 3.0
In [3]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error

Data-label separation

In [26]:
X, y = traingroup.iloc[:,1:-1],traingroup.iloc[:,-1]
In [27]:
data_dmatrix = xgb.DMatrix(data=X,label=y)

Training set : Validation set = 8 : 2

In [33]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)   

Using xgboost regression model

In [7]:
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.1, max_depth = 5, alpha = 10, n_estimators = 10)
In [8]:
xg_reg.fit(X_train,y_train)

preds = xg_reg.predict(X_test)
In [9]:
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
RMSE: 7.867915

Hyperparameter tuning

In [19]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
a=[0.01, 0.02, 0.05, 0.1, 0.2, 0.5,1,1.5]
for i in a:
    xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = i, max_depth = 5, alpha = 10, n_estimators = 10)
    xg_reg.fit(X_train,y_train)
    preds = xg_reg.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    print("Learning_rate: "+str(i)+"   RMSE: %f" % (rmse))
Learning_rate: 0.01   RMSE: 9.085035
Learning_rate: 0.02   RMSE: 8.908416
Learning_rate: 0.05   RMSE: 8.434721
Learning_rate: 0.1   RMSE: 7.867915
Learning_rate: 0.2   RMSE: 7.143881
Learning_rate: 0.5   RMSE: 6.483968
Learning_rate: 1   RMSE: 6.542184
Learning_rate: 1.5   RMSE: 7.132889
In [21]:
a=[1,2,5,10,20,50,100]
for i in a:
    xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.5, max_depth = 5, alpha = 10, n_estimators = i)
    xg_reg.fit(X_train,y_train)
    preds = xg_reg.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    print("Trees: "+"{0:03d}".format(i)+"   RMSE: %f" % (rmse))
Trees: 001   RMSE: 9.239887
Trees: 002   RMSE: 7.463247
Trees: 005   RMSE: 7.289360
Trees: 010   RMSE: 6.483968
Trees: 020   RMSE: 6.411767
Trees: 050   RMSE: 6.068529
Trees: 100   RMSE: 5.904932
In [22]:
a=[3,4,5,6,7]
for i in a:
    xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.5, max_depth = i, alpha = 10, n_estimators = 10)
    xg_reg.fit(X_train,y_train)
    preds = xg_reg.predict(X_test)
    rmse = np.sqrt(mean_squared_error(y_test, preds))
    print("Depth: "+"{0:02d}".format(i)+"   RMSE: %f" % (rmse))
Depth: 03   RMSE: 6.661335
Depth: 04   RMSE: 6.618984
Depth: 05   RMSE: 6.483968
Depth: 06   RMSE: 6.424804
Depth: 07   RMSE: 6.389279
In [34]:
xg_reg = xgb.XGBRegressor(objective ='reg:squarederror', colsample_bytree = 0.3, learning_rate = 0.5, max_depth = 10, alpha = 10, n_estimators = 100)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
rmse = np.sqrt(mean_squared_error(y_test, preds))
print("RMSE: %f" % (rmse))
RMSE: 6.283186

10-fold cross validation

In [11]:
params = {"objective":"reg:squarederror",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=10,
                    num_boost_round=50,early_stopping_rounds=10,metrics="rmse", as_pandas=True, seed=123)
[03:21:13] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[03:21:14] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[03:21:15] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[03:21:16] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[03:21:17] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[03:21:19] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[03:21:20] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[03:21:21] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[03:21:22] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
[03:21:24] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
In [15]:
cv_results.tail(10) #cross validation stays stable at each iteration
Out[15]:
train-rmse-mean train-rmse-std test-rmse-mean test-rmse-std
40 4.419207 0.231000 4.789951 1.321249
41 4.404774 0.233470 4.779454 1.320505
42 4.390292 0.236632 4.768963 1.323026
43 4.375749 0.243222 4.761392 1.321850
44 4.352870 0.243277 4.753761 1.323092
45 4.314009 0.225754 4.732972 1.303788
46 4.297184 0.221828 4.728216 1.300894
47 4.275600 0.222790 4.716102 1.301457
48 4.248579 0.210352 4.706685 1.290166
49 4.234774 0.206244 4.697631 1.280220

As it turns out my two market predictions contribute to the model

In [35]:
xgb.plot_importance(xg_reg)
plt.rcParams['figure.figsize'] = [20, 10]
plt.show()
In [39]:
result = xg_reg.predict(test.iloc[:,2:])
In [40]:
result
Out[40]:
array([ 0.9663857 ,  1.5607779 ,  0.44546717, ..., -0.0976705 ,
        0.06393787,  0.22638461], dtype=float32)
In [50]:
plt.rcParams['font.size'] = 15
plt.hist(result,range=(0,50),bins=50);

The rule says the results are to be clipped to [0 20]

In [67]:
result=np.asarray(result)

result=result*(result>0)

result=result*(result<20)+(result>=20)*20

plt.hist(result,range=(0,50),bins=50);

Creation of submission file

In [71]:
submission=pd.read_csv('sample_submission.csv')
In [73]:
submission['item_cnt_month']=result
In [78]:
submission.to_csv('my_submission.csv',index=False)